Estimation of the applicability domain of kernel-based machine learning models for virtual screening
نویسندگان
چکیده
BACKGROUND The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. RESULTS We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. CONCLUSION The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.
منابع مشابه
Kernel-based estimation of the applicability domain of QSAR models
Machine learning techniques have become a valuable tool to assess molecular properties without the need of in vitro experiments. Most of these methods do not give any information if a molecule that is predicted can be sufficiently described by the knowledge contained in the model. Thus, the estimation of the reliability of a model-based prediction is an important question in machine learning ba...
متن کاملMachine learning algorithms in air quality modeling
Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...
متن کاملSpatiotemporal Estimation of PM2.5 Concentration Using Remotely Sensed Data, Machine Learning, and Optimization Algorithms
PM 2.5 (particles <2.5 μm in aerodynamic diameter) can be measured by ground station data in urban areas, but the number of these stations and their geographical coverage is limited. Therefore, these data are not adequate for calculating concentrations of Pm2.5 over a large urban area. This study aims to use Aerosol Optical Depth (AOD) satellite images and meteorological data from 2014 to 2017 ...
متن کاملInvestigating the performance of machine learning-based methods in classroom reverberation time estimation using neural networks (Research Article)
Classrooms, as one of the most important educational environments, play a major role in the learning and academic progress of students. reverberation time, as one of the most important acoustic parameters inside rooms, has a significant effect on sound quality. The inefficiency of classical formulas such as Sabin, caused this article to examine the use of machine learning methods as an alternat...
متن کاملMolecular Docking Based on Virtual Screening, Molecular Dynamics and Atoms in Molecules Studies to Identify the Potential Human Epidermal Receptor 2 Intracellular Domain Inhibitors
Human epidermal growth factor receptor 2 (HER2) is a member of the epidermal growth factor receptor family having tyrosine kinase activity. Overexpression of HER2 usually causes malignant transformation of cells and is responsible for the breast cancer. In this work, the virtual screening, molecular docking, quantum mechanics and molecular dynamics methods were employed to study protein–ligand ...
متن کامل